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The END of ah ERA 
REQUIEM for the GLH* 
RIP 

As the United States entered into the First World War^ and 
AssocI at loni sm was in its heyday ^ Otis was asked to prepare a set of easily 
administered quick-screening tests for the U. S. Army. We will remember 
that Pavlov's experiments were barely out of the news at that time, and 
Watson was getting cats to do amazing things In his puzzle boxes.. It was 
natural for Otis to assume that learning Involved the establishment of 
associations or "mental connections" between a stimulus and a response. 

The rest Is history. The multiple-choice test which he adapted. 
If not invented, to this purpose proved very successful. A new era In 
measurement technology h^d been launched. 

In keeping with good associ at ioni st principles, he developed a 
scoring procedure in which the frequency of "right" answers was counted, 
the logic behind this procedure was that the right answers were assumed, 
from their design, to be correct associations . The wrong possibilities 
were to be "plausible" but were expected to be chosen by "trial and error" 
in the absence of the correct associations. In other words the respbndant 
either KNOWNS the answer or GUESSES. Since these "guesses" were considered 
to be "blind" (the product of trial and error) it was assumed that no 
meaningful information would be available from them. 

In addition to this, since the average pattern across all wrong 
answers was expected to be random, there was also expected to be a certain 
number of the "right" answers which represented "lucky guesses." In this 
case, it would be Impossible to determine which piarticular right answer 

^ ^"1 

GLH refers to the General Linear Hypothesis. 



was meaningful, and which was riot; so that item responses could hot be 
interpreted. Only some sort of accumulation of responses could be useful, 
when attempting to assess learner status • 

the result of these assumptions about the way in which the 
respondant to the test would behave when taking the test^ was a two step 
scoring procedure: 



STEP ONE4 Scoi^tng^ the i tems • 

In this step the procedure was to make a pass through the test 
and to compare the respondant's answer with the one keyed in 
the predetermined control pattern as being the RIGHT answer, 
the respondant^s particular answer was then changed to a "one*' 
(1) for a match, and a "zero'' (o) for a mismatch. If a "cor- 
rect I on-for-guessi ng'* were to be used then "omit ted'* answers 
were left BtANK. 



(1) x^. - 6 (1,6) where x.. is the actual response of the itb 

Individual on the Jth Item 
and 5 (1,0) Is the resulting binary conversion. 

STEP TWO: Scoring the test 

In this step the jarocedure was to add the vector formed in step 
brie for each respondant . VJhen a "correction - for - guessing" 
was used^ this was the third step. 

Mathematical ly; 

(2) X. = T 6-* where n is the number of items ori the test. 

^ J.V^l 

This information Is very well known ^ but is repeated here for 
several reasons. First, It should be noted that a good translation of the 
Assoclatlonist model into mathematical terms Would conform precisely to 
these procedures. Second, it Is well known that whenever the GLM is applied 
to test data, step^ on^ Is almost always appl ied FIRST^ le. BEFORE any work 
with the GLM Is attempted. Third, this second step is also gerterally 
applied BEFORE the use of the gLm, whenever analyses other thari item 
analysis is contemplated. If there has been a predetermined subtest then 
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equation (2) is applied to the scoring of this subtest before further analysis 
as well. Fourth^ it is also well known^ and logically obvious that if either 
of these two scoring procedures can be shown to be inval id , then this demon- 
strati bh also INVALIDATES the use of the GLM after these precedures have 
been applied. 

Final ly^ if step one is invalid^ then the GLH can not be legiti- 
mately Used Upon the. basic data set^' because it is a nominal (categorical) 
scale, and the assumptions of the GLM require an INTERVAL scale to be fully 
functional,. 

Most psychometricians would agree that the above two and a half 
pages are obvious, and as such are probably not worth repeating. I would 
agree, except that j now propose to show that BOTH step one AND step two 
may be INVALID upon both psychological grounds AND upon statistical grounds. 
I have repeated these "obvious" facts precisely because they are obvious to 
the point that we often pay little attention to them, and to stress the fact 
that invalidating these two steps is equivalent to invalidating the USE of 
the GLM cnce these scoring procedures have been applied. 

Of coarse, once we have some other procedure of transformation 
which validly converts these data into interval scales, then the use of the 
GLM becomes valid once more. However, if this alternative procedure accounts 
for nearly all of the available explainable variance by itself, then the 
use of the GtH would be valid, but UNNEeESSARY. 

It is within this context that I am arguing that we have come to 
the end of an ERA. It is indeed possible that the use of the GLM upon test 
data AFTER these two steps have been applied has yielded so many ambiguous 
and null results simply because these two steps removed so many of the 
available discriminations and so much of the available variance from the 
data set that little was left for the most powerful procedures to find. 
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jn order to explain why such a possibility was hot discovered 
before, we heed only realize that curvi-linear phenomena yield strange 
results under linear analysis. It is desirable to avoid curvi-linear 
problems, if possible, since unlike with the GLH, a general solution for 
these problems does hot exist. The few attempts which were made to Include 
all ahswers, have all been "one shot" attempts which have riot clearly 
pointed ih ahy direction. It is quite reasonable to find people abandoning 
such attempts with more pressing problems at hand. 

i have an uritisual advantage over most researchers ih the respect, 
because as a secoridary-school classroom teacher I Used test analysis on my 
tests for years before becoming a researcher, I therefore already knew 
that wrong answers had diagnostic value from practical experience long before 
I became interested in the statistical and the other psychological properties 
of this part of the data set. As a result^ when the diagnostic properties 
did not jump out at me from my linear analyses of these segments of the 
distribution, I had the experience based motivation to persist. 

Ih Search of 
the Esoteric 

The term "esoteric" means "hidden." And the properties of 
ahswer selection distributions which I have been seeking have certainly beeh 
well hidden. At first glance, it would appear that I have beeh tryihg to 
process the "noise" in the system. In fact, this possibility has been 
raised upon several occasions. My rejoiner has been that it is not a 
property of NOISE to show consistant patterhs with different tests and 
different age groups across what is now approaching a dozen studies. 
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What is even rnore interesting Is the fact that although my findings 
seem paradoxical to the researcher, inclining him or her to discount these 
results, these findings do not seem so strange to the pract i cioner. In 
brief, I have found that "wrong answers", those strange little fellows we 
have been turning into "nothing" these many years ^ are actually more meaning- 
ful that the "righty answers we have been using to determine the achievement 
status of learners! 

How I finally came to this paradoxical conclusion is too long a 
story for this paper. I propose, therefore, to try to capture the flavour 
of these events by extending the study j reported to the NCME in Boston 
last month (Apri I I98O) . 

To begin with, 1 did not begin to get clearly definitive results 
until I returned to the disaggregated basic data patterns. It was only 
from the cross-tabulated cbntihgehcy tables of the relationships between 
item pairs and single item repetitions that the underlying properties of 
thisise distributions became clear. Earlier efforts using wrong answers as 
dummy variables in multiple regression equations showed the superiority of 
wrong answers, but did not show the SAME wrong answers to be more meaningful 
in the different studies using a variety of age groups. Not only were 
wrong ansv/ers consistantly better, but also they were consistantly i nconsi stant . 
Shakespeare once said, "A pox on both your houses." which was precisely the 
way I was feeling until I hit upon the use of the cohtihgehcy tables. 

These tables present a serious problem for interpretation, < 
however* We can easily determine whether or not a table is homogeneous 
from the size of its aggregate wherein we determine the expected values 
from the marginal proportions. A non-homogenious table would reflect non- 
linearity. Too many more than a random number of such tables in a data 
set^ and the data itself is non-linear and inappropriate for use with the GtM. 
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However, where do we go after establishing non-homogenei ty? If the hon- 
hoiTiogeneity Is not conslstantly in particular rows or columns^ then we are 
In very real difficulties. The only other place for this departure- from 
linearity to occur is within the cells of the table themselves. The critical 
values for cell values are Indeterminate because of a zero in the denomi- 
nator of the equation. 

To get around this problem^ my colleagues and I (Powell, Shklov 
and Rahim, I98O) developed an alternative approach. Using the same assump- 
tions as Otis for the generation of a data set^ we simulated the data we 
were using In these current studies. The regression lines for each item was 
assumed to be the trend for selection for that item, and the standard 
deviation for the scatter of observed points about that line was assumed to 
be the measurement error in that Item. The regression value for an age 

_ . _ __ . _ _ _ . 

level random value distributed as error' was set as the probability 

that this answer would be "right''. Wrong answers were distributed as equi- 
probahle across the difference between and one. A second random number, 
rectangularly distributed betvieen zero and one determined the simulated 
answer. A data set which duplicated the original In age group level fre- 
quencies was then generated, and the Cbntlngency tables for the simulation 
were struck. The frequency distribution for averages of the frequencies of 
cell values was determined. The cumulative proportions for the final 
frequency polygon were used for the critical values for cell x^. A Monte 
Carlo approach to verify these values has been conducted. These values are 
reported In the work cfted, bat the more important ones are : p = .10; 
X^ = 1 jP = .05; x'' = 2.1: and p = .01 ; x^ = 3 .8. These values also 
closely approximate the extrapolated values we could obtain from standard 



These random numbers were normally distributed with a mean of zero and a 

standard deviation of the assumed s • 

e 
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tables. 

From this combination of simalatlon and Mbhte Carlo procedures we 
generated a useful hew tool for the Interpretation of the ce 1 1 s fh a con- 
tingency table. We could now look for meaning where It had previously been 
unavat 1ab1e«. 

Comparisons by 
the Million 

To give you some idea of the scale of these studies, 1 gave a 4o 
item test in reading comprehension ( The Proverbs TesT , Gorham, 1956) to 
more than ^000 students in the age range from 7' years to 20+ years. The 
test was administered twice with a 5 month gap between administrations. 
This procedure netted me nearly 3000 students wh5 had taken the test twice. 
By dividing the age range Into intervals of 5 months on each child's age 
in months, I obtained 30 age levels with an average of about 100 in each 
group. The 5 month grouping had two purposes. First, ft would not matter 
how j blocked or combined these s-jbjects, I would not get duplicate represent 
ation in any grouping. Second, the 5 rfionth Interval represented half of a 
16 month school year. 

table 1 gives the break-down of the subgroupfng of the sample. 



INSERT TABLE 1 ABOUT HERE 



The sample presents a representative selection of the schools in 
an ufban industrial city in the mid-V\/est of about a quarter million population 
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With 30 age levels, 2 admiritstrat i6ns, and items, and two 
other tests with which to make compari soni, the x contingency tables 
will generate millions of cells to examine. As a result, only a small 
inroad into the properties of these data have been made since 1977-»78 
when they were collected. 

As already indicated this large scale study was mounted because 
earlier Indications of the value of wrong answers and of the possibility of 
an underlying curvi -linear pattern in these data distributions warranted 
trying to get enough data points to plot some patterns. 

Thus far there have been three major cuts into the mass, the 
first one (Powell, I978) reported the fact that replication between this 
sample and a previous one from 1975 (with 550 subjects) was found. The 
replication, though not yet complete enough to be inequlvocal, was strong 
enough to suggest that what is being reported here represents general 
properties of the test as well as the specific properties of this particular 
sample. 

The second study (Powell, 1979) considered the interactions among 
the first 5 items on this test across the age range. Considering a "meaning- 
ful" interaction to be related to a cell of greater than 2.3, two types 
of interaction were expected. The level of 2. if minimum was conservatively 
based upon extrapolation before the simulation study. These two types bf 
interaction were; when the observed frequency meaningfully exceeded the 
expected frequency, and the reverse of this relationship. Where 0 > E an 
event similar to a +ve correlation was assumed and the interaction was 
considered to be "joint." The reverse was "mutually exclusive." 

Making the safne assumptions made by Otis^ sevej-al hypotheses could 
be considered. Right answer by right answer interactions should balance 
toward the joint type (which they did; 60 to 0 out of 600 poss i ble) , should 
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be more ffequerit that 5% of the events (they were ]6%) arid should increase 
with age as less guessing occurred (they DECREASED with age) • The right 
by wrong answer interactions should be exculslve rather than joint, and 
otherwise show the same patterns as for R ^ R. They were found to be 
mutually exclusive (12 to 220 but of 3600); represented 6.kk% of the events; 
and did NOT change with age. Wrong by wrong interactions, taken as random 
events j should show about equal numbers of joint and exclusive events 
(joint events were FAVOURED 633 to 39 out of 5^00); should be less that 5% 
of the possible (they were ]2.kk%) and should show no age pattern' (they 
INCREASED in frequency with age). 

As we can see from these paradoxical results^ the events related 
to the right answers generally support these hypotheses, but the dries 
related to the wrong ones completely REFUTE them. Clearly wrorig answers 
show systematic properties beyond the level of being "noise" in the system. 
It is equally clear that an approach to. this problem which considered only 
the right answer relationships would not reveal this fact. A researcher 
would probably explain this DECREASE in R * R interactions as a property 
in the i^cr^se in the frequency of right answer SELECTION, and let it go 
at that point. With the joint W ''^ W pattern exceeding the R * R inter- 
actions by more than 10 to 1 (633 to 6b) and the R " W events near the chance 
level, we can begin to see both how much is being lost when we convert the 
wrong answers to zeroes and why this has been missed in other studies. 

It is now clear that the way to derive the distribution properties 
of answer selection is to use a large sample, segmented cross-section 
contingency table analysis. The crucial part of this analysis Involves 
some cell-by-cell interpretation and comparison, for which the obtaining 
of the critical values for eel 1 seems to be essential. 



Before going to the third study (Powell^ igSO) of which this one 
Is an extension, we should turn briefly to the psychblogf cal properties of 
answer selection. 

Paradoxes 

Paradoxes 

Paradoxes 
araaoxes 

The first analytic study I conducted into the area of wrong answer 
selection (Powell, I968) Involved this same test being used here, except 
with college duniors and Seniors. The right answers showed a strong single 
factor with good separation using principal components analysis, so that I 
was certain that i was using a "good*' test before I began. In addition to 
collecting answers, I asked these students to explain their selections in 
a separate booklet. j used their reasoning to classify the four wrong 
answer factors, which showed simple structure, which I studies further. For 
the most part, these wrong answer selections reflected reasoning errors 
such as linking only part of the proverb to a translation of It (Over- 
simplification.) The general quality of the reasoning from one group 
effectively described the reasoning within all members of a factor in another 
class of students about two thirds (M) of the time. As a result, the 
diagnostic use of wrong answers I had witnessed as a Math/Science teacher 
in secondary school seemed to be present at the college level In this 
language -based test, but not as strongly as expected. 

The surpirising finding was that one type of ''wrong*' answer 
(Irrelevancles; defined as true statements unrelated to the problem) I 
found clear evidence of multiple reasons for the selection of the same 
answer. Those students in the middle of the range by total -correct score 
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indicated their choice was based upon the truth of the statement. There 
was also a smaller group^ who scored in the 80th jjerceritile range, who 
interpreted the question differently^ but legitimately, and chose this so- 
called Irrelevancy Upon a basis whiVh was logically CORRECT, brice we admit 
the appropriateness of their alternative interpretations. Here was evidence 
that sorne "wrong'* answers which should be considered RIGHT^ and which 
identified those who had "over-read'* the question. 

To probe this issue further^ I collected the reasoning^ using 
trained interviewers, from about two thirds of the 550 children (ages 8 to 
16) to whom I gave this same test in May of 1975. I used a rioh-parametr ic 
procedure to develop a set of twelve wrong answer subtests ^ arid then classi- 
fied these using the reasoning jarotbcbls. This classi f icatibn was supported 
by between 50l and 66% of the reasoning repbrts for all 12 subtests. This 
test, by the way, has twb "right" answers scbres. One for "cbricrete" and 
the more usual one for "abstract" answers. As such I hbped that the trari- 
sition between concrete and abstract reasoning in children's verbal recog- 
nition patterns might be illuminated. 

j used a simplex approach to ordering these subtests: that iS| 
the more closely related subtests were to each other the nearer they were 
in the resulting sequence. The big surprise was "no surprise." All I'* sub- 
tests arranged themselves into an order which reproduced the age sequence 
without exception . This is the closest j have come to seeing a perfect 
correlation in all of my years of working with live data. A very strong 
developmental influence seemed to be present among these data. No stretch 
of the imagination could conjure this observation as ''noise" in the system. 

When I looked at the interpretations, a sequence strongly 
reminiscent of Piaget's accounts was present. Many of these wrong answers 
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displayed eORRECt REASONiNG once we took into accbunt the DEVELOPMENTAL 
PERSEPEetiVES in their cbgnitiori of these learners. These answers were 
NOT WRONG when the alternative world-view which characterizes their develop- 
mental stages is taken into account. Figure 1 below illustrates this point. 



Insert figure i about here 



Of the three influences (interpretation, procedure and information 
content) the order of precidence seemed to be as just given. 

The eight year olds who chose alternative "6" seem to be inter- 
preting the proverb (Quickly ccDme^ Quickly go) in terms of their own 
physical movement about the classroom, their typical reasoning (that's 
what the teacher ^as^jl/ays says) links the concept of "being quick about it" 
quite accuPately to the answer they selected (Always do things on time). 
They have apparently not yet decentered and are still inteppreting such 
statements in terms of their personal exjaeriences. Once we know what is 
happening, the connection between the choice of answer and the reasoning 
behind it becomes quite clear. These linkages are NOT "trial and ePror" 
In any sense, random 5r not, but come directly from the emerging logic of 
the learner. 

I cbuld continue to make the same point with each of the other 
"wrong" answers, but such would be redundent and has already been done 
elsewhere (Powell, 1977). These observations when contrasted with the 
proposal attributed to Association Theory at the beginning of this papep 
(learning involves forming "correct" associatons and that in their absence 
"trial and error" will be observed) leave little doubt about the psychological 
invalidity of that theory. QED. 

14 
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it would not be anriasonable to expect to find the SitheSit j cal 
translation of a behavloaral theory to be behaviourally Invalid if the 
theory itself is shown to be invalid. Hence the disappearing Paradoxes in 
the title of this section. 

We can now return to the mainstream of our discussion about the 
distribution properties knowing that an effective analytic procedure should 
expose sequences of answers rather than merely a transfer to "right'' 
answers, and along with this there should be some sort of emerge-decl i ne 
pattern of curved-line events to account for these transitions. We also 
need to find some reason why these patterns have not been more clearly 
evident before this. 

The Si lent 
Jiggle 

The third study in this series (powell, 198O) also proved t5 be 
the most profitable to date. In this one I cross-tabulated the with in- item 
pre- post- events for all of the learners who had taken the test twice. My 
jjurpose was to try to determine the patterns of change which occurred among 
answer selection. In this case an 0 > Eevent would be a stable selection, 
with more people than expected giving the same answer upon both occasions 
(for the events in the principal diagonal), and unstable for the reverse 
relationship. The change events would be shown by the patterns in the off- 
diagonal cells. ^ 

The major findings were that stability overwhelmed change and the 
rong answers were significantly more stable than the right ones with this 
stability increasing with age for the wrong answers, and decreased for the 
right ones. This observation suggests that an Important characteristic of 
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development I s i hcreased di vers? tyi 

However^ I am a bit ahead of myself. Before we start to consider 

the patterns among the cells, we heed to satisfy ourselves that there Is 
enough curvi -1 i hear! ty among these data to trouble about. Since about two 
thirds of the 1200 contingency tables were NOT homogeneous , th i s issue 
was dIsjDensed with quickly. 

The balance of our discussions w!11 center upon Item 18 since it 
was this Item which I reported in Figure 1 (page 12) so that we already 
know the expected order to be found from the psychblogi cal sequence (C 
D ^ A B*) . 

Because stability overwhelmed the off-diagonals, I will deal with 
these issues first. Among the 1200 tables to be considered there were ^8bO 
diagonal cells of which 1200 were repeated choice of the right answer and 
3600 were for repeated choice of the wrong answer. Of these kSOQ more than 
kO% (or 1965) had values which exceeded 2.0. Only one of these was a 
significantly unstable event. Of the stable events, the stability of the 
wrong answers exceeded that of the right answers by a factor of about 3.5 
to 1 (or 1527 to ^37; differences between proportions z = 3.75) and as a 
result most of the off-di agonal significant events had 0 < £ as avoidances 
rather than changes. 

If we look only at Item 18, as shown in Figure 2, the pattern of 
stability becomes clear. 



INSERT FIGURE 2 ABOUT HERE 



Th 



e verticle arrangement in Figure 2 is the "psychologicaT' order 
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given above, the number of horizontal lines Indicates the level of signi- 
ficance with one line for Pi .16, two for p $ .05 and three for .01. 
Thus we can consider this horizontal density to reflect degrees of stability 
from "stable" to "extremely stable". Values less than this will be homogeneous 
and will indicate that the marginal frequencies are sufficient to accouht 
for the choices, and that the frequency of repeated choice on the pre-test 
should be considered to be Independent of this same frequency on the post- 
test. 

If we consider only the "wrong" answers (the lower three) for the 
moment, there is a visually evident progression of the density of stability 
from left to right (with increasing age). The persistence of alternative 
"e" is a bit surprising as Is the long extremely stable period for alternative 
"D". It should be mentioned that although the age scale is In years, these 
data actually represent 5 month age blocks, hence the length of the representa- 
tive coding is Inconsistant with the age scale. 

Considering the right answers. Item 18 Is unusual since the majority 
of right answers have their stability to the far left. The extremely stable 
section from about age 15 on would, most commonly, not be there. It is also 
evident that there are not very many sections of the age range without at 
least some stability. jt fs clear that the internal dynamics of this item 
seems to be. In general, more meaningful than are the marginal (aggregate) 
frequencies. Later on we will see that the marginals actually supply 
different Information than does these internal dynamics. 

In order to overcome the overwhelming Influence of the stability 
factor, I used a procedure which may prove to be equivalent to the pro- 
cedure In factor analysis which remove the first factor's .Influence In 
order to find the second one. I isimply dropped the diagonal frequencies 
and recalculated the on the assumption of homogeneous diagonal elements 
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as shown In Figure 3. 



INSERT FIGURE 3 ABOUT HERE 



In this Figure I show the original print-out table, and the 
"doctored" table below it to show the impact of this procedure. Notice that 
the off-diagonal frequencies are identical on both tables but that the 
locations of significant events has changed considerably. With the actual 
diagonal frequencies in place, two of the three significant events are 
avoidances and are in the "row/column" relationships to the most highly 
stable event. This observation suggested to me that~-the stability was 
"overwhelming" the change pattern. With the diagonals remove, reducing the 
total frequency from 96 to 63, the three significant elements are all changes. 
The procedure achieved its objective but the impact of the violation of the 
assumptions for calculations is uncertain at thfs time.* 

The fact that this second-order contingency table achieved the 
purpose for which the procedure was designed, may have been fbrtaitous, but 
it made me bold enough to try to use It for a third order level of analysis, 
i collected all of the changes which emerged for the 30 tables in Item 18 
and arranged these frequencies into a "doctored" 4 x A table like the one 
just discussed. The resulting significant "changes" from thfs third order 
analysis recovered the psychological sequence which we already have seen 



There seems to be no mathematical problem with the cell x^, since this is 
merely an alternative model for the "expected" values. However, for the 
overall x^_ this procedure forms i n "incomplete" model which creates, 
problems in the determination of the number of degrees of freedom. 
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from the logic of the reasoning and now have from statistical procedures; 
the full details are elsewhere (Powell^ 1980) so need not be repeated here 
(both ERIC and the Library of Congress have copies). 

For this present study ^ 1 have taken this analysis one step 
farther, i found the average for the linkage points for the changes which 
were shown to be meaningful. From this information I prepared Figure 



INSERT FIGURE k ABOUT HERE 



the number of lines in each arrow^* follows the same code for the 
strength of change in Figure k as was Used for stability In Figure 2 (page 14.) 
the vertical arrangement of responses was obtained by the average of the 
changes to that choice. It is Identical to the psychological order from 
Figure 1. jt seems to follow an accelerating Upward pattern like the bottom 
of a growth curve. Each point In this curve seems to be associated with a 
period of stability of response selection. 

the downward arrows are also Interesting. The shift from to 
•«D*^ predates the reverse trend. Perhaps '*D" is more powerful at this age. 
The very strong trend from "D'* to "C" starts a period of stability (inter- 
rupted once In 26 months) which is then followed by a 20 month gap. this 
gap, which begins at age 16, coincides with the youngest legal school - leaving 
age in this system. The return to ''egocentri ci ty" ahead of early school 
leaving is an intriguing possibility which makes intuitive sense. If 



two arrows are not shown in this Figure. In these two, although signi- 
ficant, they represented "avoidances" (O < E) not changes. These were 
"C/B" and "A/C." 
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18. 

sappoi-ted when i process the "school - leaving" data^ ft wodld suggest clear 
signals that they may be "high risk" may be being sent 8at by these young 
people several, if not many, months ahead of time. Since this particular 
pattern involves a change from brie "wrong" answer to another, using "zeroes" 
obliterates it. This information is clearly NOT AVAILABLE when only the 
"right" answers are considiered. 

The downward movement from the "right" answer at the top would 
also not be anticipated from the Associ at ioni st model, but if it represents 
those those who have progressed so far that they are be^irining to "over- 
read" some questions, then this downward vector may actually be a reflection 
off of the ceiling of the test and may represent an upward c5ntiriuation of 
development. This change results, when "ri ght"' answers alone are being 
considered, in a LOWERING of the learner's SCORE. 

I did not try to put these two diagrams completely together since 
the visual complexity would have reduced the impact of these observations. 
It is quite clear that "wrong" answers Seem to- contain a good deal of develop- 
mental information which is not available from the "right" answeFs. These 
curved-line patterns, with reversing directions, suggest that development 
may follow multiple pathways. With this much complexity. It is not 
surprising that trying to reach into it from the one-sided direction of the 
"right" answers has not proven to be very successful. 

The last straw with respect to the Associationlst "i<now-guess" 
hypothesis comes from the next pair of observations. 



In Rand 
with GOD 

Raving looked at the statistically meaningful change patterns 



So 



within these tables, i decided to explore two other sources for change 
patterns which were available from these data. One of them was the bhanges 
in the marginal totals for each alternative on the pre- to post- test tran- 
sitions. The other was the changes in these same values from one 5 month 
age block to the next. Very strong support for this complexity found fn the 
comparisons between these two changes would further invalidate current 
practice. 

Figure 5 gives a schematic version of the first of these two change 
patterns, showing the general pattern with small irregularities removed, 
and Figure 6 does the same for the other pattern. 



INSERT FIGURE 5 ABOUT HERE 



These linear pathways were derived from an ehd-to-end set of 
vector drawings from the within-group pre- to post- aggregate changes. The 
pattern is actually somewhat more irregular since I have attempted to 
capture only the major trends here. (See: Powell, I98O for the actual 
results.) 

The right answers appear to be a "step function," with a spurt in 
the eleventh and the fourteenth years. The one in the eleventh has been 
noticed by several practi doners of my acquaintance, although I have not 
seen it discussed in the literature. The one at fourteen may be associated 
with the transition from concrete to formal operations discussed extensively 
by Piaget and others. The relationsip between the increase in right answers 
at age 11 and the decline in alternative "C" is puzzeling because "C" to "B" 
transfbrmat ibris is one of the two "avoidance" events.* 

* The other one was "A" tb "C". 

21 



The up-turn In both and '"A" at the extreme right are also of 
note. The one for 'W* Is jDrbbably related to the cascade effect from the 
right answers already noticed. The timing is off for the "D" to "C" change. 
Wf thin-group aggregate patterns do not seem to coincide as closely to the 
internal Item dynamics as we could expect. 

When we turn to the between-group dynamics, yet another picture 
emerges. This pattern is shown in Figure 6. 



INSERT FIGURE 6 ABOUT HERE 



The surge-plateau pattern characteristic of the wl thin-group dynamics has 
gone, to be replaced by the bei^ patterns found among the 1975 data as well. 
(See; Yu, 1977.) 

Comparison on an event-by-event basis, thus, has shown that the 
withln-group dynamics Is significantly DIFFERENT from the between-group 
dynamics (sign test for inter-point direction equivalence; z = -2.6b). 
The progression seems to be from cell -by -cell to withln-group to between- 
group dynamics, giving an Interpretation clue. The marginal changes for 
the withln-group are actually a composite of three sub-groups (those who 
stayed, who arrived, and who departed).. However, with the marginal pro- 
portions REMOVED in the homogenlety comparisons for the cell -by-cell dynamics 
It is partly coincidental for the marginal and the Internal changes to 
occur in the same direction. 

In addition. It appears from these observations that developmeht 
may go in more than one direction. In this case^ populations may not be 
homogeneous, but have a complex sub-structure instead. From these con- 
siderations. It appears that aggregates may reflect sub-population mix 

2^ 



rather than development . 

In the replication study* it was clear that sub-population mix 
was Important since the selection proportions had to be changed in the 
pattern fit. Also, the pattern from the suburban group of 1975 had to be 
raised in average age level by a full year to fit the community cross-section 
sample of 1977-'78, Since the two studies were an average of 30 months 
apart, both coming from the same community, the replication would seem to 
have been achieved from an averaging effect with the developmental dynamics 
and the sub-population mix. I Would probably riot have achieved as good a 
replication had differing communities been used, or had a longer time- 
span elapsed. Whatever strange coincidence of riatural events occurred to 
produce these same configurations (although at. different selection levels) 
for all four alterriatives between these two samples we may never know. 

This much seems to be clear ^ however, that the aggregates seem to 
be more sensitive to cross-sectional events from the sub-population mix, 
and the Internal events in each item seem to be more sensitive to the long- 
itudinal events related to the development of cognition and achievement. It 
is entirely possible that the use of the scbririg procedure derived from 
Assbciat ionist theory has been removing from our data set the information 
we were seeking before we began our data analysis! 

In keeping with good culinary traditon, I have saved the best for 

the last. 

Out of 
Ndwhe re 

^ _ 1 _ . _ _ _ _ . _ 

The use of ri to determine explairied variance for fitted curves gave an 
explained variance in excess of .60 for each of the four curves 
separately. I do riot know how to get the block-fitting of all four 
curves as a unit from these results. Perhaps the replication was as 
good as .80 explained variance. 

o 

ERIC 



We now turn oar attention to the pattern of departure from homo- 
geneity within item l8, which produced a finding so startling as to all but 
confound the senses. Figure 7 gives the basic observations. 



INSERT FIGURE 7 ABOUT HERE 



There Is nothing particularly out of the ordinary in this Figure, 
except that 1? of the 30 tables are non-homogeneous. Little doubt is left 
in this observation that complex curvi -1 1 near i ty is present. 

It is also evident (visually) that the major influence in the size 
of the departure from homogeneity is related to the group size which j 
indicate with the dotted line in the background. The next question would 
reasonably be, what would be the pattern if the impact of this aggregate^^ 
Vvere to be removed? Figure 8 gives this result, taken in two steps. 



INSERT FIGURE 8 ABOUT HERE 



In part A of Figure 8, I used the simple expedient of dividing 
each value by its corresponding group size. This step seemed to generate 
what appeared to be a cyclical pattern. 1 assumed a two year interval, for 
a reason I will indicate in a moment, and sketched such a pattern as back- 
ground, and inserted the center-line. It appeared from this procedure that 
the use of the linear transformation of dividing by the group size over- 
compensated for its effect. 

Rather than trying to determine the exact transformation (perhaps 
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a square root or a logarithm) I simply rescaled to make the cehter-lfhe 
straight. Part B of Figure 8 gives the results. In the hypothetical 
cyclical pattern extracted by '^removing'' the impact of group size, the 
''eleventh year spurt'* is clearly evident. There is also the possibility of 
some sort of second order oscillation since the oscillations seem to increase 
in magnitude to the right in two separate stages. 

This pattern would not be particularly remarkable, and in point of 
fact could be "noise" in the system, had I not found a very similar pattern 
f»^om an Independea i source, Vfe must remember^ that this pattern has the 
impact of the marginal frequencies removed when homogeneity was being 
determined. As a result, if a similar pattern is found using the marginal 
frequencies, then this pattern can NOT be noise > 

In the study of which this present one is an extension, I rep5rted 
an unusual observation. To begin with, the replication results pointed to 
the possibility that the irregular variations about the regression lines may 
not be "noise" as is typically assumed. To explore this possibility, I 
assumed these patterns to be multi-modal. I had a student (Alison Cairdj 
tabulate the frequencies of the primary and secondary modes for both the 
right and the wrong answer selection proportions for all MO items and all 
but the very lowest and highest age levels (where the group sizes were too 
small to be representative of the cohort). These Counts reflect the marginal 
proportions, a factor which is removed when considering the degree of non- 
homogeneity. Figure 9 shows these results with the same cyclic phases 
superimposed. 



INSERT FIGURE 9 AB0UT HERE 



The heavy line is the pattern for the modes for the right answers, 
and the dotted line Is the iiSe pattern for the wrong answers, made more 
similar in appearance by using a 2 to one rather than a 3 to 1 for the 
abscissa. The surprise i had in the earlier study (Powell, igSO) was that 
the wrong answers lagged the right answer cycle by about 5 months rather than 
being contra-cyclic as would be expected. The surprise this time was that 
if we consider the right and wrong cycles as a single pattern, then the non- 
homogeneity pattern is closely in the opposite phase to the modes-of-modes 
pattern for half of the cycles. Assuming, as seems reasonable, that a fifth 
level of modes would most likely peak at age 11 for the right answers, and 
19 for the wrong answers, this pattern may resolve even more closely, jn 
modes-of-modes pattern, we once again see a tendency for the oscillations 
to increase in amplitude to the right. 

It appears that these "Independent" sources mutually reinforce 
the evidence of non-linearity among these data. The cyclic pattern seem 
to support Piaget's phase and stage "cllnlcar' model, except that rather 
than converging upon "formal operations" as a unifying entity, learners 
seem to DIVERGE AS THEY LEARN TO THINK. 

The END of 
an ERA 

j began this discourse by suggesting that multiple-choice tests 
and our current practice for scoring them arose from early Associationist 
theory, then In the forefront of psychological thinking about learning. 
These Inventions Instituted a new era into educational testing. 

I also suggested how the twin concepts of "mental connections" 
and "trial and error" combined into the "know-guess" hypothesis to lead 
directly to current scoring practice. The mathematical translation of this 
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latter hypothesis firbduced a two stifj proceddre; jn step bhe^ the ''right" 
answers were scored "one" as a "good association" and the "wrong" ariswePs 
were scored "zerrf' as a "guess." Some of the right answers would be 
"Idcky guesses" so particular answers could not be interpreted. 

The second step counted the number of "right" answers to form a 
"total -correct" score which was assumed to reflect how much the learner 
"knew." The total was sometimes modified to remove the "lucky guesses." 
Hence current scoring practice. 

If either or both of these procedures are invalid, then the use 
of the General Linear Model (gLM) AFTER tests are scored using these pro- 
cedures would invalidate the results from the gtM. I then presented 
evidence which demonstrates both procedures to be invalid. 

The invalidation upon psychological grounds was based Upon the 
observation that the major contributor to wrong answer selection Was item 
interpretation which frequently leads to the considering of LOGICALLY 
CORRECT answers to be wrong. Diagnostic and other information is also 
present among : these "wrong" answers. Few such answers are "blind guesses." 

Statistical discbnf i rmationj which required the development of a 
new procedure for the interpretation of contingency tables, was dependent 
upon several considerations. The psychological pattern for development which 
suggested that learners moved from one Wrong answer to another befope^ 
reaching the "right" one, was derived statistically from these data using 
these new prBcedures. Information about learners not available from the 
right answers thus becomes available. 

Bock's (1972) study showing little gain from Knowledge level 
wrong answers, and these which show considerable gain for Comprehension 
level wrong answers (with other evidence showing even more giiS at the 
Analysis level) ieems to suggest that present procedures may be appropriate 
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for test Items requiring low level skills like recbghftlbn or recall; The 
observed increase In diversity as thinking skills increase suggests their 
misapplication for higher order tests. Other evidence suggests that even 
with low level skills, the current scoring procedure may be being misapplied 
when the discrimination level of the test and the performance levels of the 
learners do not match. Both of these shortcomings can be substantially 
reduced by the simple expedient of considering all answers in our inter- 
pretation attempts. 

It appears that much useful information about learners and the 
learning process, including answers which are LOGICALLY CORRECT may be lost 
by scoring the "wrong" answers as "zero." 

From replication attempts, from internal dynamics analysis and 
from with in -group/between -group comparisons, it appears that the internal 
dynamics of an item, when full disaggregated, reflects the longitudinal 
and other developmental properties of learning and achievement, while 
aggregation of these data seems to reflect the cross-sectional properties 
of sub-population mix. 

Current educational research seems to show that the cross- 
sectional properties of groups seem to overwhelm the longitudirial projDerties 
inherent in these data sets. Perhaps, however, the scoring procedure has 
been systematically removing these longitudinal properties from these data 
before the analysis. As just one example, research to date has seemed to 
show little advantage favouring one approach to teaching over another. 
However, once population diversity has been controlled, the internal dynamics 
approach may show considerable differential effect of a sub-group specific 
nature, 

A final nail in the coffin of current scoring practice was driven 
lil^ when systematic curvl -1 Inearlity at a second level and perhaps even a 
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third level was found and I ridejjenderit ly supported (to a degree) among 
these data. With so much curvi-l inearity, current procedures (including 
the GLH) are not approprlare until valid transformations for these data 
have been found arid can be employed. 

We are irideed at the end of an ERA, not of mal tiple-chbice testing^ 
these are more powerful than experience seemed to show, bat at the end of 
the use of a scoring procedure which has served us less well than we have 
thought these past 6b years. 

The imp li cat ions from this research can be summed up in' one 
sentence. All of the educational research which has used the present 
scoring procedure BEFORE commencing other analyses will need to be reworked. 
Via- con beos4 fest. in Peace! 
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TABLE 1 



DISTRlBUTjON OF SUBjEetS 
IN THIS STUDY BY AGE tEVEt AND THE TIME 
QF ADMINISTRATION OF THE TEST. 



AGE AGE AGE OCTOBER MARCH OVERLAP GIAND 

LEVEL IN IN ADMIN- ADMIN- OCT. /MAR. TOTALS 

MONTHS YEARS I STRATI ON I STRATI ON 



i 


AIM < 96 


- 


53 




3 


25 


56 


2 


96 - 100 


8 


68 


-^MOSTLY 


32 




lob 


3 


1b1 - 105 




70 


samT"^ 


**55-^— ■ 


■"^50 


125 


ii 


106 - lib 


9 


130 


GROUP 


53 


102 


'■ 183 


5 


111 - 1 15 




101 




120 


62 


221 


6 


116 - 12(3 


10 


127 




78 


95 


205 


7 


121 - 125 




137 




101 


100 


238 


8 


126 - 130 




152 




115 


102 


256 


9 


131 - 135 


11 


155 




118 


110 


263 


10 


136 - 1*»0 








IOC 


9/ 


tell 
255 


11 


151 - 155 


12 


165 




106 


119 


271 


12 


156 - 150 




135 




131 


90 


266 


13 


151 - 155 




138 




100 


88 


238 


14 


156 - 160 


13 


152 




105 


99 


256 


15 


161 - 165 




115 




132 


73 


246 


16 


166 - 170 


15 


163 




IQl 


lio 


264 


17 


171 - 175 




262 




15D 


189 


412 


18 


176 - 180 


15 


2S$ 




23Z 


195 


501 


19 


181 - 185 




258 




257 


201 


505 


20 


186 - 190 




251 




255 


177 


506 


21 


191 - 195 


16 


259 




228 


162 


577 


22 


196 - 200 




219 




220 


145 


539 


23 


201 - 205 


17 


210 




219 


117 


529 


2§ 


206 - 210 




17-1 




173 


88 


355 


25 


211 - 215 




186 




130 


85 


316 


26 


216 - 220 


18 


125 




131 


58 


251 


27 


221 - 225 




8Z 




8] 


$0 


168 


28 


226 - 230 


19 


$7 




66 


18 


113 


29 


231 - 250 


20 


29 




$3 


9 


63 


30 


250 < AIM 




10 




14 


5 


25 



FIGURE 1 
AN EXAMPLE OF THE 
PSYCHOLOGICAL BASES 
FOR ANSWER SELECTION 

Proverb: QUiCKLY COHE^ CiUiCKLY GO. (EASY COME, EASY GO.) 



Alternat I ve 



a. ALWAYS COMJNG AND GOING 
AND NEVER SATISFIED. 



b. WHAT YOU GET EASILY DOES 
NOT MEAN MUCH TO Yob. 



_Age of most 
Common Choice 



13 



Adult 



Reported 
Reasoning 

YbU should stick to 
a Job Vtjl it's 
f i ri i shed . 

keyed as the RIGHT 
An swe r . 



c. ALWAYS DO THINGS ON TIME, 



d, MOST PEOPLE DO AS THEY 
PLEASE AND GO AS 
THEY PLEASE. 



8 



1e 



That's what a teacher 
always says. 

It talks about coming 
and going. 



Source; Item 18 frcxn The Proverbs Test by Donald R. Gorham^ 

Missoula Montana, Psychol ig]al Test Specialists, 1956. 
Reproduced with permission. 
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FIGURE 2 



STABILITY PATTERNS 
AMONS ANSWER SEtECTIONS 
FOR I TEH 18 




I I I $ ' t 1 — — — *" — ' r — r - I - ■■■ - I" r 

8 9 lb 11 12 13 1^ 15 16 17 18 19 

Age In Years 

KEY 

p t .10 

p ^ .05 
p ^ .01 

Possible developmental patterri(s) 



AN EXAMPLE OF A PROCEDURE TO 
OBTAIN RIGRER 0RDER REtATIONSHjPS 
FROM CONTINGENCY TABLES 



BASje DAtA: itetfi lSAge level 10 (136-140 Months) 



FREQUENCY 
EXPECTED 
CELL CHI2 


1 PRE 

[ft 1 6* 


• C 


1 D 


POST A 


i 10.0 i 5.4 

1 0.1 j 0.4 


3 

4.2 

1 0.4 


j 1.4^ 


B* 


' 5 ^ 13 
10.3b ! 5.6^ 

2.?- i ^ 


! 4 
4.4 

0.0 


r--"--f- 

1 9.7 
0.3 


e j 


3.8 j 2.1 1 
0.2 1 0.5 ! 


4 

Mi 


3 

3.6 
0.1 


D 1 

1 


16 1 0 ] 
8.9_ ; 4.9^ i 


3 1 
3.6 1 
0.2 j 


. 7 
8.4 
0.2 


TOTALS 33 18 14 31 

* 2 ^2 
OVERALL X = 29.0; p < .005; df = S 



I 
I 

__L 



I 
I 

i 

■t 

i 



I 
I 

i 

J 

i 
I 



__i 



TOTALS 
29 

30 

ii 

26 



96 



% 
30 

31 

12 
27 



100 



"Doctored" table with iflaifl diagonal removed. 

PRE 



FREdUENCY 
EXPECTED 
CELL CHI 2 



POST 



B* 



B* 



4 

1.6. 



— — t 
3 ! 
3?2 I 

b.b I 



... 



D 



13 
3.8' 



I 

A I 
"— — "t — t 

. 5 \ immm\ 
6.5 \unmm\ 

0.3 I//////////! .... 

3 ; \ \mmHH\ 

2.7 I 0.6 •//////////,' 

0.0 ! 0.3 !//////////! 
. . ...^ ... ... 4„ ^ 

16 ! 0 i 3 I ///////////; 

7.2^ j 1.5^ I 3.0 I ///////////; 



8 

6;5 
0.3 



3 

2.7 
b.b 




TOTALS 2 
0VERALL X = 
NOTES: a. P ^ .10; 



2* 5 .10 

21.2; p < i0057; ?^ 
b. P .05; Ci P- ^ ibii 
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TOTALS 



20 



17 



19 



63 



FIGURE k 



WITHIN - ITEM DEVELOPMENTAL PATTERN 
FROM CONTINGENCY - TABLE ANALYSIS 




1— ^ 1 1 « — > r 



8 9 10 11 12 13 1^ 15 16 17 18 l9 

Age in Years 

KEY 

p ^ .10 

p ^ .05 

r- -1 p $ .01 
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FieURE 5 

DYNAMles OF WItHiN - GRbOP CHANGES 
USING MARGINAL PROPORTIONS 



\ 



V 

• 

\ 




8 d 10 11 12 13 1^ 15 16 17 18 19 



Age In Years 

KEY 

A • 

NOTE: Tfi!s pattern has been simplified from the original vector pathways 
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FlQURE 6 

DYNAMICS OF BETWEEN - GROUP GHaNGES 
USING AVERAGE MARGINAL PROPORTIONS 




FIGURE 8 

EXPLORATION. FOR A POTTERN BY 
REMOVING THE DOHiNANT EFFECT 




8 9 10 n 12 13 14 15 16 17 18 19 



PART B Cycle with straight center line 

I 1 j I I i 

t I i I i I 

I t"" /'*^ --^-"'7T: i\\ 
L.-^-" / 1 \ / M / 1 V / ! \ / n 

•*\ / I \ f M / I J 7 i V 1 ! V 
i \ / I \ / ! \ M V / ! ! / ! V 

I ^w' I 

\ i I 

1 I i 



KEY 



CFi I Square Cycle pFiase 

Fitted curve Trend line 

Center line 
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FIGURE 9 

MATCHING CYCLIC PATTERNS FROM 
TWO INDEPENDENt SOURCES 



ERIC 



PART A 
f(r) 



Modes-of-mbdes from all right and all wrong answers with cyclic phase 
and averaging pattern 



f(w) 



15- I 



lb- 




-^30 



-20 




Cycles compared j 

Right answers 
Wrong answerSj 
Averige curve( 
Previous curve 
Trend lines 



Age 



.4. 



I • ^ » i * i • • 

• I i/ V I V V : 



: I 



I .A A i ,A V , - 



I. 

V ; j • 

A! 



NOTE: Trend lines taken from previous curve 
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